Abstract : Over 3,000 genome-wide association studies (GWAS) across a decade of research have discovered genetic risk variants and biological pathways that play a role in specific diseases, disorders and behavioural traits. The NHGRI-EBI GWAS Catalog indexes all published studies, to which we construct, harmonise and then merge various data sources and conduct multiple detailed analyses in order to provide the first systematic, data-driven examination of GWAS to date. We pursue multiple distinct avenues of interest; participants, funders, researchers, traits, cohorts and consortium. We first replicate and substantially expand existing analysis with regard to participant ancestry over time, adding further evidence of bias towards recruitment from European and American countries. We then provide an analysis of who funds GWAS (and what they fund): over 85% of acknowledgements relate to NIH grants, with almost the entirety of the remainder relating to UK agencies. A network analysis of GWAS authors shows a dense group of Principle Investigators contributing large datasets, reflected in centrality measures and an original ‘GWAS H-Index’. Approximately 40% of authors are female, although men dominate the typical ‘senior author’ position (71%). We also find a gendered division across the studied traits, where the dominant categories relate to neurological disorders, neoplasm and drug responses. An extensive manual data collection exercise summarises the most frequently utilised cohorts for which we collate a range of summary descriptives, and we also provide an overview of the main consortium in play. In light of these findings, we discuss challenges and solutions for the future of genomics.
This is an iPython Notebook to accompany the paper entitled 'The Architecture of Genome Wide Association Studies' by Melinda Mills and Charles Rahal. This allows us to share data analysis done in the construction of this data-driven review of all GWAS to date. It can also be used to dynamically replicate the analysis herein going forward. For installation guidelines, please see the readme.md accompanies this repository\clone\zip. A preliminary point to note is that if you wish to run this .ipynb file itself, you may need to override the default settings of some versions of Jupyter Notebook (4.2 to 5.1?) by opening with:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000
or editing your jupyter_notebook_config.py file. Lets begin by loading in our favourite python modules, ipython magic and a bunch of custom functions we've written specifically for this project:
import networkx as nx
import os
import pandas as pd
import re
import requests
import seaborn as sns
import sys
import numpy as np
import csv
import itertools
import warnings
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib as mpl
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap
from matplotlib.colors import Normalize
from matplotlib import rc
import gender_guesser.detector as gender
from Bio import Entrez
from PIL import Image
import requests_ftp
from wordcloud import WordCloud, STOPWORDS
from IPython.display import HTML, display
from Support.LoadData import (
EFO_parent_mapper, load_gwas_cat, load_pubmed_data,
make_timely, make_clean_CoR
)
from Support.PubMed import (
build_collective, build_author, build_funder,
build_abstract, build_citation
)
from Support.Ancestry import ancestry_cleaner
from Support.Additional import (grey_color_func, clean_names)
warnings.filterwarnings('ignore')
mpl.rcParams.update(mpl.rcParamsDefault)
%config InlineBackend.figure_format = 'png'
%matplotlib inline
%load_ext autoreload
%autoreload 2
Then, let's dynamically grab the three main curated datasets from the GWAS Catalogue EBI website that we will need for our endeavours ('All Associations','All Studies', 'All Ancestries') and the EFO to Parent trait mapping file from their ftp:
#r = requests.get(
# 'https://www.ebi.ac.uk/gwas/api/search/downloads/studies_alternative')
#with open(os.path.abspath(
# os.path.join('__file__', '../..', 'Data', 'Catalogue', 'Raw',
# 'Cat_Stud.tsv')), 'wb') as tsvfile:
# tsvfile.write(r.content)
#r = requests.get(
# 'https://www.ebi.ac.uk/gwas/api/search/downloads/ancestry')
#with open(os.path.abspath(
# os.path.join('__file__', '../..', 'Data', 'Catalogue', 'Raw',
# 'Cat_Anc.tsv')), 'wb') as tsvfile:
# tsvfile.write(r.content)
#r = requests.get('https://www.ebi.ac.uk/gwas/api/search/downloads/full')
#with open(os.path.abspath(
# os.path.join('__file__', '../..', 'Data', 'Catalogue', 'Raw',
# 'Cat_Full.tsv')), 'wb') as tsvfile:
# tsvfile.write(r.content)
#requests_ftp.monkeypatch_session()
#s = requests.Session()
#r = s.get('ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/latest/gwas-efo-trait-mappings.tsv')
#with open(os.path.abspath(os.path.join('__file__', '../..', 'Data', 'Catalogue',
# 'Raw', 'Cat_Map.tsv')), 'wb') as tsvfile:
# tsvfile.write(r.content)
Lets link PUBMEDID variable to the PUBMED API and get a series of datasets from that using the support functions written in PubMed.py. Note: collective corresponds mostly consortia and study groups.
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
id_list = set(Cat_Studies['PUBMEDID'].astype(str).tolist())
#Entrez.email = 'your.email@goeshere.com'
#papers = Entrez.read(Entrez.efetch(db='pubmed', retmode='xml', id=','.join(id_list)))
#build_collective(papers)
#build_author(papers)
#build_funder(papers)
#build_abstract(papers)
#build_citations(id_list,Entrez.email)
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
print('There are currently: ' + str(len(Cat_Studies['PUBMEDID'].unique())) +
' GWAS papers published (in terms of unique PUBMEDIDs)')
print('The first observed GWAS in the catalogue was dated: ' +
str(Cat_Studies['DATE'].min()))
print('However, only: ' + str(len(Cat_Studies[Cat_Studies['DATE'].str.contains(
'2005|2006')])) + ' papers were published in 2005 and 2006 combined')
print('There are currently: ' +
str(len(Cat_Studies['STUDY ACCESSION'].unique())) +
' unique Study Accessions')
print('There are currently: ' +
str(len(Cat_Studies['DISEASE/TRAIT'].unique())) +
' unique Diseases/Traits studied')
print('These correspond to: ' +
str(len(Cat_Studies['MAPPED_TRAIT'].unique())) +
' unique EBI "Mapped Traits"')
print('The total number of Associations found is currently: ' +
str(round(Cat_Studies['ASSOCIATION COUNT'].sum(), 1)))
print('The average number of Associations is currently: ' +
str(round(Cat_Studies['ASSOCIATION COUNT'].mean(), 1)))
print('Mean P-Value for the strongest SNP risk allele is currently: ' +
str(round(Cat_Full['P-VALUE'].astype(float).mean(), 10)))
print('The number of associations reaching the 5e-8 threshold is ' +
str(len(Cat_Full[Cat_Full['P-VALUE'].astype(float) < 5.000000e-8])))
print('The journal to feature the most GWAS studies since ' +
str(Cat_Studies['DATE'].min()) +
' is: ' + Cat_Studies['JOURNAL'].mode()[0])
print('However, in 2017, ' + Cat_Studies[Cat_Studies['DATE'].str.contains(
'2017')]['JOURNAL'].mode()[0] +
' published the largest number of GWAS papers')
print('Largest Accession to date: ' +
str(Cat_Ancestry_groupedbyN.sort_values(by='N',
ascending=False)['N'].iloc[0]) +
' people and was published in ' +
str(Cat_Studies[
Cat_Studies['STUDY ACCESSION'] ==
Cat_Ancestry_groupedbyN.sort_values(by='N',
ascending=False)['STUDY ACCESSION'].iloc[0]]['JOURNAL'].iloc[0]) +
'. The first author was: ' +
str(Cat_Studies[
Cat_Studies['STUDY ACCESSION'] ==
Cat_Ancestry_groupedbyN.sort_values(by='N',
ascending=False)['STUDY ACCESSION'].iloc[0]]
['FIRST AUTHOR'].iloc[0]) + '.')
print('Total number of SNP-trait associations is ' +
str(Cat_Studies['ASSOCIATION COUNT'].sum()) + '.')
print('Total number of journals publishing GWAS is ' +
str(len(Cat_Studies['JOURNAL'].unique())) + '.')
print('The study with the largest number of authors has: ' +
str(AuthorMaster.groupby(['PUBMEDID'])['AUTHORNAME'].count().max()) +
' authors.')
Lets make a nice infographic and do some basic NLP on the abstracts from all PUBMED IDs. This figure is the header on the readme.md:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
words_array = []
with open(os.path.abspath(
os.path.join('__file__', '../..', 'Data', 'PUBMED',
'Pubmed_AbstractCount.csv')),
'r', errors='replace') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['word'].lower() not in STOPWORDS:
if len(row['word']) > 3:
words_array.append(
(row['word'].upper(), float(row['num_words'])))
mask = Image.new('RGBA', (8000, 4467))
icon = Image.open(os.path.abspath(
os.path.join('__file__', '../..', 'Data', 'Support',
'doublehelix_mask.png'))).convert('RGBA')
mask.paste(icon, icon)
mask = np.array(mask)
wc = WordCloud(background_color='white', max_words=1250, mask=mask,
max_font_size=5000)
wc.generate_from_frequencies(dict(words_array))
wc.recolor(color_func=grey_color_func)
wc.to_file(os.path.abspath(
os.path.join('__file__', '../../', 'Figures', 'pdf',
'helix_wordcloud_1250_5000_black.pdf')))
plt.figure(figsize=(16, 8))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()
The file Pubmed_AbstractCount in PUBMED (subdirectory of Data) details a breakdown of the abstract counts. Check counts for 'European', 'Asian', etc.
We can also visualise the ever increasing sample sizes and popularity of GWAS studies. The top panel shows the increasing number of GWAS conducted over time. This is partly due to the increasingly large sample sizes: the fraction of 'Big N' sample size studies increases. The bottom left subplot shows that with this growth comes bigger N and a high correlatation with the number of significant Associations found. The right subplot shows the unique number of journals publishing GWASs per year, as well as the unique number of diseases/traits studied also steadily increases as the appeal of GWASs broadens.
plt.style.use('seaborn-ticks')
plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.linewidth'] = 0.75
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
yearlist = []
yearquarterlist = []
for year in range(2007, 2018):
yearlist.append(str(year))
for quarter in ['Q1', 'Q2', 'Q3', 'Q4']:
yearquarterlist.append(str(year) + quarter)
variables = ['N ≤ 5,000', '5,001 ≤ N ≤ 50,000', '50,001 ≤ N ≤ 100,000',
'100,001 ≤ N', 'N', 'Associations', 'Journals Printing GWAS',
'# Diseases Studied']
df_years, df_quarters = make_timely(variables, yearlist, yearquarterlist,
Cat_Studies, Cat_Ancestry,
Cat_Ancestry_groupedbyN)
plt.figure(figsize=(15, 10))
axA = plt.subplot(2, 1, 1)
ax0variables = ['N ≤ 5,000', '5,001 ≤ N ≤ 50,000',
'50,001 ≤ N ≤ 100,000', '100,001 ≤ N']
ax0 = df_quarters[ax0variables].plot(kind='bar', stacked=True, ax=axA,
color=['orangered', 'steelblue',
'goldenrod', 'forestgreen'],
alpha=0.75, edgecolor='k')
ax0.set_ylabel('Number of Study Accessions', fontsize=12)
ax0.tick_params(labelsize=10)
ax0.legend(fontsize=12, loc='upper left')
ax0.grid(b=True, which='major', color='#d3d3d3', linestyle='--', alpha=0.75)
axB = plt.subplot(2, 2, 3)
ax1a = df_years[['Associations']].plot(ax=axB, color='orangered', alpha=0.75,
rot=90, marker='o', linewidth=1.5,
markersize=8,
label='Associations Discovered',
markeredgecolor='k',
markeredgewidth=0.5)
ax1b = axB.twinx()
ax1b.plot(df_years[['N']], color='steelblue', marker='s', markersize=7,
linewidth=1.5, markeredgecolor='k', markeredgewidth=0.5)
ax1a.set_ylabel('Number of Associations Discovered', fontsize=12)
ax1b.set_ylabel('Number of Study Participants Analyzed', fontsize=12)
ax1b.grid(False)
axB.plot(0, 0, '-r', color='steelblue', marker='s', markersize=7,
markeredgecolor='k', markeredgewidth=0.5)
axB.legend(['Associations (left)', 'Participants (right)'],
fontsize=12, loc='upper left')
ax1a.tick_params(labelsize=10)
ax1b.tick_params(labelsize=10)
axB.grid(b=True, which='major', color='#d3d3d3', linestyle='--', alpha=0.75)
plt.axis('tight')
axC = plt.subplot(2, 2, 4)
axtest = axC.twinx()
ax_2a = df_years[['Journals Printing GWAS']].plot(kind='bar',
ax=axC, position=1,
color='orangered',
legend=False, width=0.35,
alpha=0.75, edgecolor='k')
ax_2b = df_years[['# Diseases Studied']].plot(kind='bar', ax=axtest,
position=0, color='steelblue',
width=0.35, legend=False,
alpha=0.75, edgecolor='k')
ax_2a.set_ylabel('Unique Number of Journals Publishing GWAS', fontsize=12)
ax_2b.set_ylabel('Unique Number of Diseases Studied', fontsize=12)
ax_2b.grid(False)
axC.plot(np.nan, 'orangered', linewidth=4)
axC.plot(np.nan, 'steelblue', linewidth=4)
axC.legend(['Journals (left)', 'Diseases (right)'],
fontsize=12, loc='upper left')
ax_2a.margins(1, 0.5)
ax_2a.tick_params(labelsize=10)
ax_2b.tick_params(labelsize=10)
axC.grid(b=True, which='major', color='#d3d3d3', linestyle='--', alpha=0.75)
plt.axis('tight')
plt.tick_params(axis='both', which='minor', labelsize=10)
plt.tight_layout()
plt.savefig(os.path.abspath(os.path.join('__file__', '../..',
'Figures', 'svg',
'GWAS_Popularity.svg')),
bbox_inches='tight')
This section of analysis only uses data contained within the GWAS catalogue, but is slightly deeper than Section 2 above and related papers in the literature. We use the 'Broad Ancestral Category' field and use it to aggregate from 135 combinations of 17 ancestries to 7 'broader ancestral categories'. To get a more detailed analysis, we load in some of our supporting functions to clean up the "INITIAL SAMPLE SIZE" and "REPLICATION SAMPLE SIZE" free-text fields in the catalogue and then plot something analogous to Popejoy and Fullerton. The supporting functions are based on regular expression and data wrangling techniques which exploit patterns in the these free-text fields. By way of a very simple example: “19,546 British ancestry individuals from 6863 families.” will get cleaned to two seperate fields: “19,546” and “British” which can then be used for further downstream analysis. A slightly more complex example: “2,798 European ancestry individuals, 228 French Canadian founder population individuals” will correspond to two entries of 2798 and 228 in the new ‘Cleaned N’ type variable, corresponding to ‘European’ and ‘French Canadian’ in the ‘Cleaned Ancestry’ type variable respectively. Note that the figure only focuses on specific countries or ancestries which are mentioned and ignores the very broad mentions of continental or aggregated ancestries such as "European", "African American", "Asian", etc. However, these can still be found in the lists below, decomposed into both Initial and Replication (based on the 'INITIAL SAMPLE SIZE' and 'REPLICATION SAMPLE SIZE' free text fields accordingly). We first tabulate and list the detailed ancestries in first the initial then the replication samples, then aggregate between the two into a table, and then visualise it:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Cat_Ancestry['Broader Ancestral'] = Cat_Ancestry['Broader Ancestral'].str.replace("/","\\")
Cat_Studies['InitialClean'] = Cat_Studies.apply(
lambda row: ancestry_cleaner(row, 'INITIAL SAMPLE SIZE'), axis=1)
Cat_Studies['ReplicationClean'] = Cat_Studies.apply(
lambda row: ancestry_cleaner(row, 'REPLICATION SAMPLE SIZE'), axis=1)
fileout = open(os.path.abspath(
os.path.join('__file__', '../..', 'Data',
'Catalogue', 'Synthetic',
'new_initial_sample.csv')),
'w', encoding='utf-8')
fileout.write('STUDY ACCESSION,Cleaned Ancestry,Cleaned Ancestry Size\n')
for index, row in Cat_Studies.iterrows():
checksum = 0
for ancestry in row['InitialClean'].split(';'):
number = re.findall(r'\d+', ancestry.strip())
if (len(number) == 1):
checksum += 1
if checksum == len(row['InitialClean'].split(';')):
for ancestry in row['InitialClean'].split(';'):
number = re.findall(r'\d+', ancestry.strip())
words = ''.join(i for i in ancestry.strip() if not i.isdigit())
if (len(number) == 1) and (len(words.strip()) > 3) and \
(sum(1 for c in words if c.isupper()) > 0):
fileout.write(row['STUDY ACCESSION'] + ',' +
words.strip() + ',' + str(number[0]) + '\n')
fileout.close()
fileout = open(os.path.abspath(os.path.join('__file__', '../..', 'Data',
'Catalogue', 'Synthetic',
'new_replication_sample.csv')),
'w', encoding='utf-8')
fileout.write('STUDY ACCESSION,Cleaned Ancestry,Cleaned Ancestry Size\n')
for index, row in Cat_Studies.iterrows():
checksum = 0
for ancestry in row['ReplicationClean'].split(';'):
number = re.findall(r'\d+', ancestry.strip())
if (len(number) == 1):
checksum += 1
if checksum == len(row['ReplicationClean'].split(';')):
for ancestry in row['ReplicationClean'].split(';'):
number = re.findall(r'\d+', ancestry.strip())
words = ''.join(i for i in ancestry.strip() if not i.isdigit())
if (len(number) == 1) and (len(words.strip()) > 3) and \
(sum(1 for c in words if c.isupper()) > 0):
fileout.write(row['STUDY ACCESSION'] + ',' +
words.strip() + ',' + str(number[0]) + '\n')
fileout.close()
clean_intial = pd.read_csv(os.path.abspath(
os.path.join('__file__', '../..', 'Data',
'Catalogue', 'Synthetic',
'new_initial_sample.csv')),
encoding='utf-8')
clean_initial_sum = pd.DataFrame(
clean_intial.groupby(['Cleaned Ancestry']).sum())
clean_initial_sum.rename(
columns={'Cleaned Ancestry Size': 'Ancestry Sum', }, inplace=True)
clean_initial_count = clean_intial.groupby(['Cleaned Ancestry']).count()
clean_initial_count.rename(
columns={'Cleaned Ancestry Size': 'Ancestry Count', }, inplace=True)
clean_initial_merged = clean_initial_sum.merge(pd.DataFrame(
clean_initial_count['Ancestry Count']), how='outer', left_index=True,
right_index=True)
clean_initial_merged = clean_initial_merged.sort_values(
by='Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_initial_merged.iterrows():
holder = holder + \
str(index) + ' (' + str(row['Ancestry Sum']) + \
',' + str(row['Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_initial_merged)) +
' ancestries found in the \'INITIAL SAMPLE SIZE\' column. ' +
'These are: (number of people used used in all studies, ' +
'number of studies included):\n\n' +
holder[:-2] + '.\n\n')
clean_replication = pd.read_csv(os.path.abspath(
os.path.join('__file__', '../..', 'Data',
'Catalogue', 'Synthetic',
'new_replication_sample.csv')),
encoding='utf-8')
clean_replication_sum = pd.DataFrame(
clean_replication.groupby(['Cleaned Ancestry']).sum())
clean_replication_sum.rename(
columns={'Cleaned Ancestry Size': 'Ancestry Sum', }, inplace=True)
clean_replication_count = clean_replication.groupby(
['Cleaned Ancestry']).count()
clean_replication_count.rename(
columns={'Cleaned Ancestry Size': 'Ancestry Count', }, inplace=True)
clean_replication_merged = clean_replication_sum.merge(
pd.DataFrame(clean_replication_count['Ancestry Count']),
how='outer', left_index=True, right_index=True)
clean_replication_merged = clean_replication_merged.sort_values(
by='Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_replication_merged.iterrows():
holder = holder + \
str(index) + ' (' + str(row['Ancestry Sum']) + \
',' + str(row['Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_replication_merged)) +
' ancestries found in the \'REPLICATION SAMPLE SIZE\' column.' +
' These are (number of people used used in all studies, number of' +
' studies included):\n\n' + holder[:-2] + '.')
Lets put this number into tables to get a deeper understanding, first including a row which has at least one ancestry as NR, and then dropping all rows in which at least one recorded ancestry is NR:
Broad_Ancestral_Full_initial = pd.DataFrame(
Cat_Ancestry[Cat_Ancestry.STAGE ==
'initial'].groupby(['Broader Ancestral'])['N'].sum())
Broad_Ancestral_Full_initial.rename(
columns={'N': 'N (Initial)'}, inplace=True)
Broad_Ancestral_Full_replication = pd.DataFrame(
Cat_Ancestry[Cat_Ancestry.STAGE ==
'replication'].groupby(['Broader Ancestral'])['N'].sum())
Broad_Ancestral_Full_replication.rename(
columns={'N': 'N (Replication)'}, inplace=True)
Broad_Ancestral_Full = pd.merge(Broad_Ancestral_Full_initial,
Broad_Ancestral_Full_replication,
left_index=True, right_index=True)
Broad_Ancestral_Full['Total'] = Broad_Ancestral_Full['N (Initial)'] + \
Broad_Ancestral_Full['N (Replication)']
Broad_Ancestral_Full['% Discovery'] = (
Broad_Ancestral_Full['N (Initial)'] /
Broad_Ancestral_Full['N (Initial)'].sum()) * 100
Broad_Ancestral_Full['% Replication'] = (
Broad_Ancestral_Full['N (Replication)'] /
Broad_Ancestral_Full['N (Replication)'].sum()) * 100
Broad_Ancestral_Full['% Total'] = (Broad_Ancestral_Full['Total'] /
Broad_Ancestral_Full['Total'].sum()) * 100
Broad_Ancestral_Full.to_csv(os.path.abspath(
os.path.join('__file__', '../..', 'Tables',
'Broad_Ancestral_Full.csv')))
Broad_Ancestral_Full.style
Drop the 'in part not recorded' rows (i.e. where the Broad Ancestral Category contains at least one NR):
Broad_Ancestral_NoNR = Broad_Ancestral_Full[['N (Initial)',
'N (Replication)',
'Total']]
Broad_Ancestral_NoNR = Broad_Ancestral_NoNR.drop('In Part Not Recorded')
Broad_Ancestral_NoNR['Total'] = Broad_Ancestral_NoNR['N (Initial)'] + \
Broad_Ancestral_NoNR['N (Replication)']
Broad_Ancestral_NoNR['% Discovery'] = (
Broad_Ancestral_NoNR['N (Initial)'] /
Broad_Ancestral_NoNR['N (Initial)'].sum()) * 100
Broad_Ancestral_NoNR['% Replication'] = (
Broad_Ancestral_NoNR['N (Replication)'] /
Broad_Ancestral_NoNR['N (Replication)'].sum()) * 100
Broad_Ancestral_NoNR['% Total'] = (Broad_Ancestral_NoNR['Total'] /
Broad_Ancestral_NoNR['Total'].sum()) * 100
Broad_Ancestral_NoNR.to_csv(os.path.abspath(
os.path.join('__file__', '../..', 'Tables',
'Broad_Ancestral_NoNR.csv')))
Broad_Ancestral_NoNR.style
Lets now make some bubble plots to visualise the raw broad ancestry field and our fields extracted from the free text strings:
plt.style.use('seaborn-ticks')
plt.rcParams["font.family"] = "Times New Roman"
GroupedBroadAncestries=Cat_Ancestry[(Cat_Ancestry['BROAD ANCESTRAL']!='Other') &
(Cat_Ancestry['BROAD ANCESTRAL'].str.count(',')==0)].\
groupby(['BROAD ANCESTRAL'])['N'].sum().to_frame()
GroupedBroadAncestries['Percent Individuals'] = (GroupedBroadAncestries['N']/
GroupedBroadAncestries['N'].sum())*100
GroupedBroadAncestries = GroupedBroadAncestries.sort_values(by='Percent Individuals',ascending=False)[0:8]
countriesdict = {'African': '#4daf4a','African Am.\Caribbean':'#984ea3',
'Asian':'#e41a1c', 'European':'#377eb8','Hispanic\Latin American':'#ff7f00',
'Other/Mixed':'#a65628'}
fig = plt.figure(figsize=(12, 6), dpi=800)
ax = fig.add_subplot(1, 1, 1)
for obs in Cat_Ancestry.index:
for key, value in countriesdict.items():
if Cat_Ancestry['Broader Ancestral'][obs].strip() == key:
ax.plot_date(x=pd.to_datetime(Cat_Ancestry['Dates'][obs]),
y=Cat_Ancestry['N'][obs],
color=value, marker='.', label='the data',
alpha=0.6, markersize=Cat_Ancestry['N'][obs] / 7500)
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=24))
ax.set_xlim(pd.Timestamp('2007-01-01'), pd.Timestamp('2017-12-31'))
ax.set_ylim(0, 500000)
ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
handles, labels = ax.get_legend_handles_labels()
inset_fig = inset_axes(ax, width=3.25,
height=1.5,loc=3,
bbox_to_anchor=(0.275, .6),
bbox_transform=ax.figure.transFigure)
barplot = Broad_Ancestral_NoNR[['% Discovery','% Replication']].transpose()
barplot = barplot.plot(kind='bar', colors = ['#4daf4a', '#984ea3',
'#e41a1c', '#377eb8',
'#ff7f00', '#a65628'],
alpha=0.6,edgecolor='k',ax=inset_fig,width=0.9)
barplot.set_yticklabels('')
barplot.set_ylabel('')
barplot.set_xlabel('')
barplot.set_ylim(0,120)
barplot.get_yaxis().set_visible(False)
for p in barplot.patches:
barplot.annotate(str(round(p.get_height(),2))+'%',
(p.get_x() + p.get_width() / 2., p.get_height()+20),
ha='center', va='center', rotation=90)
plt.legend(loc='center right', bbox_to_anchor=(0.025, 0.675),facecolor='white',edgecolor='black',frameon=True, prop={'size': 10})
sns.despine(top=True, left=True,right=False, ax=ax,offset=10)
sns.despine(top=True, right=True, left=True,offset=5,ax=inset_fig)
plt.savefig(os.path.abspath(os.path.join('__file__' ,"../..","Figures","svg",
"Ancestral_Plots.svg")),
bbox_inches='tight')
plt.savefig(os.path.abspath(os.path.join('__file__' ,"../..","Figures","svg",
"Ancestral_Plots.png")),
bbox_inches='tight')
We can also analyze how these aggregates change over time, and this was a key feature of Popejoy and Fullerton (2016). We can provide a much more granular arguement (merely remove '_NoNR' from the cell below to undertake an equivlent analysis with rows which contain some in part NR ancestries):
index = [x for x in range(2007, 2018)]
columns = ['European', 'Asian', 'African', 'Hispanic\Latin American',
'Other\Mixed', 'African Am.\Caribbean']
Cat_Ancestry_NoNR = Cat_Ancestry[
Cat_Ancestry['Broader Ancestral'] != 'In Part Not Recorded']
Broad_Ancestral_Time_NoNR_PC = pd.DataFrame(index=index, columns=columns)
for year in range(2007, 2018):
for broaderancestry in Broad_Ancestral_NoNR.index.tolist():
Broad_Ancestral_Time_NoNR_PC[broaderancestry.strip()][year] =\
(Cat_Ancestry_NoNR[(
Cat_Ancestry_NoNR['DATE'].str.contains(str(year))) &
(Cat_Ancestry_NoNR['Broader Ancestral'] ==
broaderancestry)]['N'].sum() /
Cat_Ancestry_NoNR[
(Cat_Ancestry_NoNR['DATE'].str.contains(
str(year)))]['N'].sum()) * 100
Broad_Ancestral_Time_NoNR_PC.style
We can focus on the initial discovery stage to calculate the number of individuals required per ancestry to unconver one hit. Note however that this does require some distributional assumptions (i.e. that the same diseases are being studied across ancestries).
Cat_Ancestry_Initial = Cat_Ancestry[Cat_Ancestry['STAGE'] == 'initial']
Cat_Ancestry_NoDups = Cat_Ancestry_Initial.drop_duplicates(
subset=['STUDY ACCESSION'],
keep=False)[['PUBMEDID', 'STUDY ACCESSION',
'BROAD ANCESTRAL',
'N']]
Cat_Ancestry_NoDups_merge = pd.merge(Cat_Ancestry_NoDups,
Cat_Studies[['ASSOCIATION COUNT',
'STUDY ACCESSION',
'MAPPED_TRAIT']],
how='left', on='STUDY ACCESSION')
listtoiterate = ['European', 'East Asian', 'South Asian',
'African Am.\Caribbean', 'Hispanic\Latin American']
for ancestry in listtoiterate:
temp = Cat_Ancestry_NoDups_merge[Cat_Ancestry_NoDups_merge[
'BROAD ANCESTRAL'].str.strip() == ancestry]
print('The number of ' + ancestry + 's required to find one hit: ' +
str(round(1 /
(temp['ASSOCIATION COUNT'].sum() / temp['N'].sum()), 1)))
This is a choropleth map of country of recruitment - note that this field is not fully populated in the Catalogue. Basemap code is loosely based on this. As we want to study the number of people recruited from each country, we can only utilize values in the ‘Country of Recruitment’ field when only one country is mentioned. For example: if N=100,000 and Country of Recruitment = {U.K., U.S.}, there is no way for us to know what the breakdown between the two countries is in the Catalogue (especially as the free text field may just contain 'European' ancestry). Our only option in this scenario is to drop such observations. This forces us to drop about 22% of all the studies, and this corresponds to about 48% of the Catalogue as measured by N (because bigger studies have multiple countries of recruitment). We faced with multiple entries equal to ‘NR’, which corresponds to ‘not recorded’. This reduces the number of studies to go into the map by a further 21% (of the initial total), and this means that the map only includes about half of all GWAS studies. This has no relationship to whether ancestry is coded.
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Clean_CoR = make_clean_CoR(Cat_Ancestry)
Cleaned_CoR_N = pd.DataFrame(Clean_CoR.groupby(['Cleaned Country'])['N'].sum())
Cleaned_CoR_N['Rank_N'] = Cleaned_CoR_N.rank(
ascending=False, method='dense').astype(int)
Cleaned_CoR_count = pd.DataFrame(
Clean_CoR.groupby(['Cleaned Country'])['N'].count())
Cleaned_CoR_count['Rank_count'] = Cleaned_CoR_count.rank(
ascending=False, method='dense').astype(int)
Cleaned_CoR_count = Cleaned_CoR_count.rename(columns={'N': 'Count'})
Cleaned_CoR = pd.merge(Cleaned_CoR_count, Cleaned_CoR_N,
left_index=True, right_index=True)
del Cleaned_CoR.index.name
Cleaned_CoR['Percent of Count'] = (Cleaned_CoR['Count'] /
Cleaned_CoR['Count'].sum()) * 100
Cleaned_CoR['Percent of N'] = (Cleaned_CoR['N'] /
Cleaned_CoR['N'].sum()) * 100
Cleaned_CoR.sort_values('Rank_N', ascending=True)[
['Count', 'N', 'Percent of Count', 'Percent of N']].head(10)
bins = np.linspace(1, len(Cleaned_CoR['Rank_N'].unique()),
len(Cleaned_CoR['Rank_N'].unique()))
cm = plt.get_cmap('OrRd')
scheme = [cm(i / 10) for i in range(10)]
Cleaned_CoR['scheme'] = [
cm(Cleaned_CoR['N'][i] /
Cleaned_CoR['N'].max()) for i in Cleaned_CoR.index]
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(111, facecolor='#deeff5', frame_on=False)
scheme = [cm(i / len(Cleaned_CoR)) for i in range(len(Cleaned_CoR))]
m = Basemap(lon_0=0, projection='robin', resolution='i')
m.drawmapboundary(color='k', linewidth=0.075)
m.drawcountries(color='k', linewidth=0.025)
m.drawmeridians(np.arange(0, 360, 60), labels=[False, False, False, True],
color='#bdbdbd', dashes=[6, 6], linewidth=0.1, fontsize=10)
m.drawparallels(np.arange(-90, 90, 30), labels=[True, False, False, False],
color='#bdbdbd', dashes=[6, 6], linewidth=0.1, fontsize=10)
m.readshapefile(os.path.abspath(os.path.join('__file__', '../..', 'Data',
'ShapeFiles',
'ne_10m_admin_0_countries')),
'units', color='#444444', linewidth=.075)
for info, shape in zip(m.units_info, m.units):
country = info['NAME_CIAWF']
if country not in Cleaned_CoR.index:
color = 'w'
else:
color = Cleaned_CoR.loc[country]['scheme']
patches = [Polygon(np.array(shape), True)]
pc = PatchCollection(patches)
pc.set_facecolor(color)
ax.add_collection(pc)
norm = Normalize()
mapper = mpl.cm.ScalarMappable(norm=norm, cmap=cm)
mapper.set_array(Cleaned_CoR['N'] / 1000000)
clb = plt.colorbar(mapper, shrink=0.75)
clb.ax.tick_params(labelsize=10)
clb.ax.set_title('Number of\n People (m)', y=1.025, fontsize=12)
plt.savefig(os.path.abspath(os.path.join('__file__', '../..', 'Figures', 'svg',
'Country_Rec_N.svg')),
bbox_inches='tight')
plt.show()
Lets now merge that via a country lookup to continent based file using data from within the shapefile itself (based on CIAWF).
countrylookup = pd.read_csv(os.path.abspath(
os.path.join('__file__', '../..', 'Data',
'ShapeFiles', 'Country_Lookup.csv')),
index_col='Country')
country_merged = pd.merge(countrylookup, Cleaned_CoR,
left_index=True, right_index=True)
country_merged_sumcount = country_merged[[
'Continent', 'Count']].groupby(['Continent']).sum()
country_merged_sumN = country_merged[[
'Continent', 'N']].groupby(['Continent']).sum()
country_merged_sums = pd.merge(
country_merged_sumN, country_merged_sumcount,
left_index=True, right_index=True)
country_merged_sums['N (%)'] = (
country_merged_sums['N'] / country_merged_sums['N'].sum()) * 100
country_merged_sums['Count (%)'] = (
country_merged_sums['Count'] / country_merged_sums['Count'].sum()) * 100
country_merged_sums.to_csv(os.path.abspath(
os.path.join('__file__', '../..', 'Tables',
'ContinentOfRecruitment.csv')))
country_merged_sums
Lets quickly see the effect of splitting or dropping for delimited entries in the trait field:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
Cat_Studies, Cat_Ancestry_groupedbyN)
print('There are ' + str(len(EFO_Parent_Mapped)) +
' rows of EFO terms after we split for commas')
print('This indicates ' + str(len(EFO_Parent_Mapped) -
len(EFO_Parent_Mapped_NoCommas)) +
' additional terms were mapped to' +
' Parents than for when we drop comma seperated values')
print('This indicates ' +
str(len(EFO_Parent_Mapped['EFO term'].drop_duplicates())) +
' unique EFO terms to map to Parents')
print('This is in comparison to ' +
str(len(EFO_Parent_Mapped_NoCommas['EFO term'].drop_duplicates())) +
' unique EFO terms in Cat_Studies')
Lets plot this out in terms of frequency of study, number of people involved and the number of associations found:
plt.style.use('seaborn-ticks')
plt.rcParams['font.family'] = 'Times New Roman'
MappedTrait_df_Count = EFO_Parent_Mapped.groupby('EFO term')['EFO term'].count(
).to_frame().rename(columns={'EFO term': 'Number of Studies'}).reset_index()
MappedTrait_df_AncSum = EFO_Parent_Mapped.groupby('EFO term')['N'].sum(
).to_frame().rename(columns={'N': 'Total Sample'}).reset_index()
MappedTrait_df_AssSum = EFO_Parent_Mapped.groupby(
'EFO term')['ASSOCIATION COUNT'].sum().to_frame().rename(columns={
'ASSOCIATION COUNT': 'Total Associations'}).reset_index()
MappedTrait_df_toplot = pd.merge(
MappedTrait_df_AssSum, MappedTrait_df_AncSum, how='left', on='EFO term')
MappedTrait_df_toplot = pd.merge(
MappedTrait_df_toplot, MappedTrait_df_Count, how='left', on='EFO term')
MappedTrait_df_toplot['Studies'] = (
MappedTrait_df_toplot['Number of Studies'] /
MappedTrait_df_toplot['Number of Studies'].sum()) * 100
MappedTrait_df_toplot['Associations'] = (
MappedTrait_df_toplot['Total Associations'] /
MappedTrait_df_toplot['Total Associations'].sum()) * 100
MappedTrait_df_toplot['Sample'] = (
MappedTrait_df_toplot['Total Sample'] /
MappedTrait_df_toplot['Total Sample'].sum()) * 100
MappedTrait_df_toplot = MappedTrait_df_toplot.set_index('EFO term')
MappedTrait_df_toplot = MappedTrait_df_toplot.sort_values(
by='Studies', ascending=False)[0:17]
MappedTrait_df_toplot = MappedTrait_df_toplot[[
'Sample', 'Studies', 'Associations']]
EFOParent_df_Count = EFO_Parent_Mapped.groupby(
'Parent term')['Parent term'].count(
).to_frame().rename(columns={'Parent term': 'Number of Studies'}).reset_index()
EFOParent_df_AncSum = EFO_Parent_Mapped.groupby('Parent term')['N'].sum(
).to_frame().rename(columns={'N': 'Total Sample'}).reset_index()
EFOParent_df_AssSum = EFO_Parent_Mapped.groupby(
'Parent term')['ASSOCIATION COUNT'].sum().to_frame().rename(
columns={'ASSOCIATION COUNT': 'Total Associations'}).reset_index()
EFOParent_df_toplot = pd.merge(
EFOParent_df_AssSum, EFOParent_df_AncSum, how='left', on='Parent term')
EFOParent_df_toplot = pd.merge(
EFOParent_df_toplot, EFOParent_df_Count, how='left', on='Parent term')
EFOParent_df_toplot['Studies'] = (
EFOParent_df_toplot['Number of Studies'] /
EFOParent_df_toplot['Number of Studies'].sum()) * 100
EFOParent_df_toplot['Associations'] = (
EFOParent_df_toplot['Total Associations'] /
EFOParent_df_toplot['Total Associations'].sum()) * 100
EFOParent_df_toplot['Sample'] = (
EFOParent_df_toplot['Total Sample'] /
EFOParent_df_toplot['Total Sample'].sum()) * 100
EFOParent_df_toplot = EFOParent_df_toplot.set_index('Parent term')
EFOParent_df_toplot = EFOParent_df_toplot.sort_values(
by='Studies', ascending=False)
EFOParent_df_toplot = EFOParent_df_toplot[[
'Sample', 'Studies', 'Associations']]
fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(18, 7.5))
fig_a = MappedTrait_df_toplot.sort_values(by='Studies', ascending=True).plot(
kind='barh', ax=axes[0], legend=False, colors=['goldenrod', 'steelblue',
'tomato'], edgecolor='k',
width=0.66)
fig_a.set_ylabel('')
fig_a.set_xlabel('Percent of Mapped Traits (%)', fontsize=12)
fig_b = EFOParent_df_toplot.sort_values(by='Studies', ascending=True).plot(
kind='barh', ax=axes[1], legend=True, colors=['goldenrod',
'steelblue', 'tomato'],
edgecolor='k', width=0.66)
fig_b.set_ylabel('')
fig_b.set_xlabel('Percent of Parent Traits (%)', fontsize=12)
fig_b.legend(prop={'size': 12}, frameon=True,
loc='bottom right', bbox_to_anchor=(1, 0.2), edgecolor='k')
plt.tight_layout()
sns.despine()
plt.savefig(os.path.abspath(os.path.join('__file__', '../..', 'Figures', 'svg',
'Diseases_Traits_EFO_Bars.svg')),
bbox_inches='tight')
Lets now decompose this across the seven different 'broader ancestral categories' we defined above:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas= EFO_parent_mapper(Cat_Studies, Cat_Ancestry_groupedbyN)
Parent_Mapped_broaderancestry=pd.merge(EFO_Parent_Mapped[['STUDY ACCESSION','Parent term']],
Cat_Ancestry[['STUDY ACCESSION','Broader Ancestral',
'N']],
how='left',on='STUDY ACCESSION')
Parent_broaderancestry_groupedby = Parent_Mapped_broaderancestry.groupby(['Broader Ancestral','Parent term'],
as_index=False)['N'].sum()
for ancestry in Parent_broaderancestry_groupedby['Broader Ancestral'].unique():
holdstring='The order of the studied Parent terms for the '+ancestry+' category are: '
tempdf = Parent_broaderancestry_groupedby[Parent_broaderancestry_groupedby['Broader Ancestral']==ancestry]
tempdf['Percent of Total']=(tempdf['N']/tempdf['N'].sum())*100
tempdf=tempdf.sort_values(by='N',ascending=False)
for index, row in tempdf.iterrows():
holdstring=holdstring+row['Parent term']+' ('+str(round(row['Percent of Total'],2))+'%) ; '
print(holdstring[:-2]+'\n')
EFO_Mapped_broaderancestry=pd.merge(EFO_Parent_Mapped[['STUDY ACCESSION','EFO term']],
Cat_Ancestry[['STUDY ACCESSION','Broader Ancestral',
'N']],
how='left',on='STUDY ACCESSION')
EFO_broaderancestry_groupedby = EFO_Mapped_broaderancestry.groupby(['Broader Ancestral','EFO term'],
as_index=False)['N'].sum()
for ancestry in EFO_broaderancestry_groupedby['Broader Ancestral'].unique():
holdstring='The order of the studied EFO terms for the '+ancestry+' category are: '
tempdf = EFO_broaderancestry_groupedby[EFO_broaderancestry_groupedby['Broader Ancestral']==ancestry]
tempdf['Percent of Total']=(tempdf['N']/tempdf['N'].sum())*100
tempdf=tempdf.sort_values(by='N',ascending=False)
for index, row in tempdf[0:20].iterrows():
holdstring=holdstring+row['EFO term']+' ('+str(round(row['Percent of Total'],2))+'%) ; '
print(holdstring[:-2]+'\n')
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
AllFunders = FunderInfo.groupby(by='Agency').count()
AllFunders.index.name = None
AllFunders = AllFunders.reset_index()
AllFunders = AllFunders.rename(columns={
'index': 'Agency',
'PUBMEDID': 'Grant Contributions',
'GrantCountry': 'Country'})
AllFunders_withcountries = pd.merge(AllFunders[['Agency',
'Grant Contributions']],
FunderInfo[['Agency',
'GrantCountry']].drop_duplicates(
'Agency'),
on='Agency', how='left')
AllFunders_withcountries = AllFunders_withcountries.set_index('Agency')
AllFunders_withcountries.index.name = None
AllFunders_withcountries['% of Total'] = round((
AllFunders_withcountries['Grant Contributions'] /
AllFunders_withcountries['Grant Contributions'].sum()) * 100, 2)
AllFunders_withcountries.sort_values(
'Grant Contributions', ascending=False)[0:20]
Lets print out some simple descriptives from this data:
print('There are ' + str(len(FunderInfo['Agency'].drop_duplicates())) +
' unique funders returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantID'].drop_duplicates())) +
' unique grants returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantCountry'].drop_duplicates())) +
' unique grant countries returned from PubMed Central.')
print('Each study has an average of ' +
str(round(len(FunderInfo) / len(id_list), 2)) + ' grants funding it.')
grantgroupby = FunderInfo.groupby(['Agency', 'GrantID']).size().groupby(
level=1).max().sort_values(ascending=False).reset_index()
print('The most frequently acknowledged grant is GrantID ' +
grantgroupby['GrantID'][0] + ' which is acknowledged ' +
str(grantgroupby[0][0]) + ' times.')
From which countries do these grants come from?
TopCountryFunders = FunderInfo.groupby(by='GrantCountry').count()
TopCountryFunders = TopCountryFunders.rename(
columns={'PUBMEDID': 'Number Of Studies'})
TopCountryFunders = TopCountryFunders.sort_values(
'Number Of Studies', ascending=False)[['Number Of Studies']]
TopCountryFunders = TopCountryFunders.reset_index().rename(
columns={'GrantCountry': 'Country of Funder'})
TopCountryFunders['Percent Funded (%)'] = (
TopCountryFunders['Number Of Studies'] /
TopCountryFunders['Number Of Studies'].sum()) * 100
TopCountryFunders = TopCountryFunders.set_index('Country of Funder')
TopCountryFunders.index.name = None
TopCountryFunders.style
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
Cat_Studies, Cat_Ancestry_groupedbyN)
EFO_Parent_Mapped['Parent term'] = \
EFO_Parent_Mapped['Parent term'].str.replace('Disorder', '')
FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped,
left_on='PUBMEDID', right_on='PUBMEDID',
how='left')
funder_parent = pd.DataFrame(
index=FunderInfo.groupby(
['Agency'])['Agency'].count().sort_values(ascending=False).
index.values[0:20].tolist(),
columns=FunderInfo_Parent.groupby(['Parent term'])['Parent term'].count().
sort_values(ascending=False).index.values.tolist())
for parent in FunderInfo_Parent.groupby(['Parent term'])['Parent term'].count().\
sort_values(ascending=False).index.values.tolist():
for funder in FunderInfo.groupby(['Agency'])['Agency'].count().sort_values(
ascending=False).index.values[0:20].tolist():
funder_parent[parent][funder] = len(
FunderInfo_Parent[(FunderInfo_Parent['Agency'] == funder) & (
FunderInfo_Parent['Parent term'] == parent)])
FunderInfo_Ancestry = pd.merge(FunderInfo, Cat_Ancestry, left_on='PUBMEDID',
right_on='PUBMEDID', how='left')
funder_ancestry = pd.DataFrame(index=FunderInfo.groupby(
['Agency'])['Agency'].count().sort_values(ascending=False).
index.values[0:20].tolist(),
columns=FunderInfo_Ancestry.groupby(
['Broader Ancestral'])['Broader Ancestral'].count().
sort_values(ascending=False).index.values.tolist())
for ancestry in FunderInfo_Ancestry.groupby(['Broader Ancestral'])['Broader Ancestral'].count().\
sort_values(ascending=False).index.values.tolist():
for funder in FunderInfo.groupby(['Agency'])['Agency'].count().sort_values(
ascending=False).index.values[0:20].tolist():
funder_ancestry[ancestry][funder] = len(
FunderInfo_Ancestry[(FunderInfo_Ancestry['Agency'] == funder) & (
FunderInfo_Ancestry['Broader Ancestral'] == ancestry)])
sns.set(font_scale=8, font='Times New Roman', style='white')
f, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharex=False,
sharey=True, figsize=(18, 11),
gridspec_kw={'width_ratios': [.25, .7],
'wspace': 0, 'hspace': 0})
gg = sns.heatmap(funder_ancestry.astype(float), ax=ax1, fmt='.0f',
annot=True, cmap='Oranges', xticklabels=True,
yticklabels=True, linewidth=2, robust=True, cbar=False,
annot_kws={'size': 10})
gg.tick_params(axis='both', which='major', labelsize=12)
hh = sns.heatmap(funder_parent.astype(float), ax=ax2, fmt='.0f',
annot=True, cmap='Blues', xticklabels=True,
yticklabels=True, linewidth=2, robust=True,
cbar=False, annot_kws={'size': 10})
hh.tick_params(axis='both', which='major', labelsize=12)
plt.gcf()
plt.setp(ax2.get_yticklabels(), visible=False)
plt.tight_layout()
plt.savefig(os.path.abspath(os.path.join('__file__', '../..', 'Figures', 'svg',
'Funder_Heatmap.svg')), bbox_inches='tight')
sns.set(font_scale=1, font='Times New Roman')
Who are the most cited authors? What is their 'GWAS H-Index' calculated based only on their papers in the GWAS catalogue? This assumes unique forename + surname combinations. First a couple of snippets:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
print('There are a total of ' + str(len(AuthorMaster)) +
' "authorship contributions".')
print('These contributions are made by ' +
str(len(AuthorMaster['AUTHORNAME'].unique())) + ' unique authors.')
print('There are a total of ' +
str(len(AuthorMaster) /
len(AuthorMaster['PUBMEDID'].drop_duplicates())) +
' "authorship contributions per paper".')
print('The study with the most number of authors has ' +
str(AuthorMaster.groupby(['PUBMEDID']).size().max()) +
' authors (PubMed ID: ' +
str(AuthorMaster.groupby(['PUBMEDID']).size().idxmax()) + ')')
Lets then calculate our GWAS-H indexes using citation data and paper counts for unique authors.
CitationCounts = pd.read_csv(os.path.abspath(
os.path.join('__file__', "../..",
"Data", "PUBMED", 'Pubmed_Cites.csv')))
AuthorMaster_withcites = pd.merge(
AuthorMaster, CitationCounts, on=['PUBMEDID'], how='left')
allauthors_formerge = pd.DataFrame(
AuthorMaster_withcites[['AUTHORNAME', 'citedByCount']])
allauthors_papercount = pd.DataFrame(
allauthors_formerge['AUTHORNAME'].value_counts())
allauthors_citecount = pd.DataFrame(
allauthors_formerge.groupby(by='AUTHORNAME')['citedByCount'].sum())
allauthors_merged = pd.merge(
allauthors_papercount, allauthors_citecount, left_index=True,
right_index=True)
allauthors_merged.columns = ['Papers', 'citedByCount']
allauthors_merged = allauthors_merged.sort_values(
by='citedByCount', ascending=False)
allauthors_merged['GWAS-H'] = np.NaN
counter = 0
for author in allauthors_merged.index:
counter += 1
temp = AuthorMaster_withcites[AuthorMaster_withcites['AUTHORNAME'] ==
author].sort_values(by='citedByCount',
ascending=False).dropna()
temp = temp.reset_index()
temp = temp.drop('index', 1)
for pubnumber in range(0, len(temp)):
if pubnumber + 1 > temp['citedByCount'][pubnumber]:
# allauthors_merged['GWAS-H'][author]=pubnumber+1
allauthors_merged.loc[author, ('GWAS-H')] = pubnumber + 1
break
sys.stdout.write('\r' + 'Calculating GWAS H-indices: finished ' +
str(counter) + ' of ' +
str(len(allauthors_merged.index)) + ' authors...')
allauthors_citecount.reset_index(inplace=True)
allauthors_papercount.reset_index(inplace=True)
allauthors_papercount.rename(
columns={'AUTHORNAME': 'PAPERCOUNT', 'index': 'AUTHORNAME', },
inplace=True)
Lets next calculate some measures of authorship centrality with Network-X. Note: this can take some time on slow computers. To get it to run faster, change the CitedByCount to be something like 100 to filter for authors with a minimum of a hundred citations only (or change PAPERCOUNT>1)
AuthorMaster_sumcites = pd.merge(AuthorMaster, allauthors_citecount,
left_on='AUTHORNAME', right_on='AUTHORNAME',
how='left')
AuthorMaster_sumcitespapercount = pd.merge(AuthorMaster_sumcites,
allauthors_papercount,
left_on='AUTHORNAME',
right_on='AUTHORNAME', how='left')
AuthorMaster_sumcitespapercount_filter_cites = AuthorMaster_sumcitespapercount[
AuthorMaster_sumcitespapercount['citedByCount'] > 10]
AuthorMaster_sumcitespapercount_filtered =\
AuthorMaster_sumcitespapercount_filter_cites[
AuthorMaster_sumcitespapercount_filter_cites['PAPERCOUNT'] > 1]
G = nx.Graph()
counter = 0
alledges = []
for paper in AuthorMaster_sumcitespapercount_filtered['PUBMEDID'].unique():
counter += 1
temp = AuthorMaster_sumcitespapercount_filtered[
AuthorMaster_sumcitespapercount_filtered['PUBMEDID'] == paper]
if len(temp) > 1:
templist = list(itertools.combinations(temp.AUTHORNAME, 2))
for edge in templist:
alledges.append(edge)
G.add_edges_from(list(set(alledges)))
print('\nThis gives us a network with ' + str(len(G)) +
' nodes. (i.e. unique authors with more than 1 paper and 10 cumulative cites)')
betcent = pd.Series(nx.betweenness_centrality(G), name='Betweenness')
allauthors_merged = allauthors_merged.merge(
betcent.to_frame(), left_index=True, right_index=True)
degcent = pd.Series(nx.degree_centrality(G), name='Degree')
allauthors_merged = allauthors_merged.merge(
degcent.to_frame(), left_index=True, right_index=True)
We can then merge all this data with some data collected manually from web searches related to their country of employment, their current employer, and Web of Science searches to collect their total H-Index. We then rank in three different ways to analyze overlap between the three metrics.
authorsupplemental = pd.read_csv(os.path.abspath(
os.path.join('__file__', "../..",
"Data", "Support", 'Author_Supplmentary.csv')),
index_col=0)
allauthors_merged_withsupp = pd.merge(allauthors_merged,
authorsupplemental,
left_index=True, right_index=True,
how='left',)
allauthors_merged_withsupp.sort_values(by='GWAS-H',
ascending=False).head(10).to_csv(
os.path.abspath(os.path.join('__file__',
"../..",
"Tables", "Authors.csv")))
allauthors_merged_withsupp.sort_values(by='GWAS-H', ascending=False).head(10)
allauthors_merged_withsupp.sort_values(by='Degree',ascending=False).head(10)
allauthors_merged_withsupp.sort_values(by='citedByCount',
ascending=False).head(10)
And then make a simple correlation matrix:
allauthors_merged_withsupp[['citedByCount', 'GWAS-H',
'Betweenness', 'Degree', 'Papers']].corr()
print('There are a total of ' +
str(len(allauthors_merged_withsupp)) + ' authors in the table.')
print('The person with the highest G-WAS H-Index is: ' +
allauthors_merged_withsupp['GWAS-H'].idxmax())
print('The person with the highest Degree is: ' +
allauthors_merged_withsupp['Degree'].idxmax())
print('The person with the highest citedByCount is: ' +
allauthors_merged_withsupp['citedByCount'].idxmax())
print('The person with the most number of Papers is: ' +
allauthors_merged_withsupp['Papers'].idxmax())
Lets consider the gender of each author by analyzing their forenames using the genderguesser library. We can directly compare our results to this wonderful paper. One of the key assumptions made here is to lump all 'mostly' male and female names into their respective male/female categories.
gendet = gender.Detector()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
AuthorCounts = AuthorMaster.groupby(['PUBMEDID'])['AUTHORNAME'].count(
).to_frame().reset_index().rename(columns={"AUTHORNAME": "Author Count"})
AuthorMaster = pd.merge(AuthorMaster, AuthorCounts, how='left', on='PUBMEDID')
AuthorMaster['CleanForename'] = AuthorMaster['FORENAME'].map(
lambda x: clean_names(x))
AuthorMaster['CleanGender'] = AuthorMaster['CleanForename'].map(
lambda x: gendet.get_gender(x))
AuthorMaster['MaleFemale'] = AuthorMaster['CleanGender'].str.replace('mostly_',
'')
AuthorMaster['isfemale'] = np.where(
AuthorMaster['MaleFemale'] == 'female', 1, 0)
AuthorFirst = AuthorMaster[AuthorMaster['Author Count']
> 4].drop_duplicates(subset='PUBMEDID', keep='first')
AuthorLast = AuthorMaster[AuthorMaster['Author Count']
> 4].drop_duplicates(subset='PUBMEDID', keep='last')
AuthorUnique = AuthorMaster.drop_duplicates(subset='AUTHORNAME')
for gender_ in AuthorMaster['CleanGender'].unique():
print(str(round((len(AuthorMaster[AuthorMaster['CleanGender'] == gender_]) /
len(AuthorMaster['CleanGender'])) * 100, 2)) + '%' +
' ' + gender_ + ' authorship contributions in total')
print(str(round((len(AuthorUnique[AuthorUnique['CleanGender'] == gender_]) /
len(AuthorUnique['CleanGender'])) * 100, 2)) + '%' +
' ' + gender_ + ' authors contributions in total')
print(str(round((len(AuthorFirst[AuthorFirst['CleanGender'] == gender_]) /
len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
' ' + gender_ + ' first authors in total')
print(str(round((len(AuthorLast[AuthorLast['CleanGender'] == gender_]) /
len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
' ' + gender_ + ' last authors in total')
print('\nPercent of male author contributions: ' +
str(round(len(AuthorMaster[AuthorMaster['MaleFemale'] == 'male']) /
(len(AuthorMaster[(AuthorMaster['MaleFemale'] == 'male') |
(AuthorMaster['MaleFemale'] == 'female')])) *
100, 3)) + '%')
print('Percent of unique male authors: ' +
str(round(len(AuthorUnique[AuthorUnique['MaleFemale'] == 'male']) /
(len(AuthorUnique[(AuthorUnique['MaleFemale'] == 'male') |
(AuthorUnique['MaleFemale'] == 'female')])) *
100, 3)) + '%')
print('Percent of male first authors: ' +
str(round(len(AuthorFirst[AuthorFirst['MaleFemale'] == 'male']) /
(len(AuthorFirst[(AuthorFirst['MaleFemale'] == 'male') |
(AuthorFirst['MaleFemale'] == 'female')])) *
100, 3)) + '%')
print('Percent of male last authors: ' +
str(round(len(AuthorLast[AuthorLast['MaleFemale'] == 'male']) /
(len(AuthorLast[(AuthorLast['MaleFemale'] == 'male') |
(AuthorLast['MaleFemale'] == 'female')])) *
100, 3)) + '%')
AuthorMaster_filtered = AuthorMaster[(AuthorMaster['MaleFemale'].str.contains(
'male')) & (AuthorMaster['Author Count'] > 4)]
AuthorMaster_filtered_merged_bydisease = pd.merge(
AuthorMaster_filtered, Cat_Studies[['PUBMEDID', 'DISEASE/TRAIT']],
how='left', on='PUBMEDID')
AuthorMaster_filtered_merged_bymappedtrait = pd.merge(
AuthorMaster_filtered, Cat_Studies[['PUBMEDID', 'MAPPED_TRAIT']],
how='left', on='PUBMEDID')
meanfemales_bydisease = AuthorMaster_filtered_merged_bydisease.groupby(
['DISEASE/TRAIT'])['isfemale'].mean().to_frame()
numberofstudies_bydisease = Cat_Studies.groupby(
['DISEASE/TRAIT'])['DISEASE/TRAIT'].count().to_frame()
mergedgender_andcount_bydisease = pd.merge(
numberofstudies_bydisease, meanfemales_bydisease, left_index=True,
right_index=True)
mergedgender_andcount_bydisease = mergedgender_andcount_bydisease.sort_values(
by='DISEASE/TRAIT', ascending=False)[0:50]
holdstring = 'Percent of authorships across 50 most commonly studied diseases \
(from the raw Cat_Studies file merged with PUBMEDID and gender guessed): '
for index, row in mergedgender_andcount_bydisease.iterrows():
holdstring = holdstring + \
index.title() + ' (' + str(round(row['isfemale'], 3) * 100) + '%), '
print('\n' + holdstring[:-2])
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
Cat_Studies, Cat_Ancestry_groupedbyN)
AuthorMaster_merged = pd.merge(AuthorMaster[(
AuthorMaster['MaleFemale'] == 'male') |
(AuthorMaster['MaleFemale'] == 'female')],
EFO_Parent_Mapped, how='left', on='PUBMEDID')
AuthorMaster_EFO = AuthorMaster_merged.groupby(
['EFO term'])['isfemale'].mean().to_frame()
AuthorMaster_Parent = AuthorMaster_merged.groupby(
['Parent term'])['isfemale'].mean().to_frame()
countstudiesperEFO = EFO_Parent_Mapped.groupby(['EFO term'])['PUBMEDID'].count(
).sort_values(ascending=False)[0:17].index.tolist()
meanfemalestudyaccession = AuthorMaster_merged.groupby(
['STUDY ACCESSION'])['isfemale'].mean().to_frame().reset_index()
meanfemalestudyaccession_withparent = pd.merge(EFO_Parent_Mapped[[
'STUDY ACCESSION', 'Parent term']], meanfemalestudyaccession,
on='STUDY ACCESSION', how='left')
meanfemalestudyaccession_withparent = meanfemalestudyaccession_withparent[
~meanfemalestudyaccession_withparent['isfemale'].isnull()]
meanfemalestudyaccession_withEFO = pd.merge(EFO_Parent_Mapped[[
'STUDY ACCESSION', 'EFO term']], meanfemalestudyaccession,
on='STUDY ACCESSION', how='left')
meanfemalestudyaccession_withEFO = meanfemalestudyaccession_withEFO[
meanfemalestudyaccession_withEFO['EFO term'].isin(
countstudiesperEFO)]
meanfemalestudyaccession_withEFO = meanfemalestudyaccession_withEFO[
~meanfemalestudyaccession_withEFO['isfemale'].isnull()]
AuthorMaster_EFO.reset_index(inplace=True)
ranks = AuthorMaster_EFO[AuthorMaster_EFO['EFO term'].isin(
countstudiesperEFO)].sort_values(by='isfemale',
ascending=False)['EFO term'].tolist()
sns.set(font_scale=1.2, font="Times New Roman", style='ticks')
fig = plt.figure(figsize=(12, 6), dpi=800)
ax1 = fig.add_subplot(1, 1, 1)
p = sns.boxplot(data=meanfemalestudyaccession_withEFO,
y='EFO term', ax=ax1, order=ranks,
x='isfemale', whis=1.5, palette="vlag", fliersize=0)
x = sns.swarmplot(y='EFO term', x='isfemale',
data=meanfemalestudyaccession_withEFO, order=ranks,
size=2.25, color="white", edgecolor="black", linewidth=0.5,
ax=ax1)
p.set_ylabel('')
p.set_xlim(0, 1.05)
p.set(axis_bgcolor='white')
p.set_xlabel('Female/Male Ratio Per Paper Across Mapped Terms')
plt.setp(p.spines.values(), color='k')
plt.setp([p.get_xticklines(), p.get_yticklines()], color='k')
sns.despine(bottom=False, left=False, offset=10, ax=ax1)
plt.tight_layout()
plt.savefig(os.path.abspath(os.path.join('__file__', "../..", "Figures", "svg",
'Gender_by_Subjects.svg')),
bbox_inches='tight')
sns.set(font_scale=1, font="Times New Roman")
AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms from figure above: '
for index, row in AuthorMaster_Parent.iterrows():
holdstring = holdstring + row['Parent term'].title() + ' (' \
+ str(round(row['isfemale'], 3) * 100) + '%), '
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms from figure above: '
for index, row in AuthorMaster_EFO[
AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
by='isfemale', ascending=False).iterrows():
holdstring = holdstring + \
row['EFO term'].title() + ' (' + str(round(row['isfemale'], 3) *
100) + '%), '
print('\n' + holdstring[:-2])
Lets use the PubMed database returns on collectives to analyze which consortium and study groups are featuring:
pubmed_collectives = pd.read_csv(os.path.abspath(
os.path.join('__file__', '../..',
'Data', 'PUBMED',
'Pubmed_CollectiveInfo.csv')),
encoding='latin-1')
collect_dict = pd.read_csv(os.path.abspath(
os.path.join(
'__file__', '../..', 'Data',
'Support', 'Collectives',
'Collective_Dictionary.csv')),
encoding='latin-1')
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.strip()
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.lower()
collect_dict['COLLECTIVE'] = collect_dict['COLLECTIVE'].str.strip(
).str.lower()
merged_collect = pd.merge(
pubmed_collectives, collect_dict, how='left', on='COLLECTIVE')
if len(merged_collect[merged_collect['Clean_Name'].isnull()]) > 0:
print('Danger! Correct collectives in Collective_Unverified.csv ' +
'and add to the dictionary\n')
unverified_collect = merged_collect[merged_collect['Clean_Name'].isnull(
)]
unverified_collect.to_csv(os.path.abspath(
os.path.join('__file__', '../..',
'Data', 'Support', 'Collectives',
'Collective_Unverified.csv')))
else:
print('The consortium dictionary is up to date!\n')
consortiumlist = []
for index, row in merged_collect.iterrows():
if pd.isnull(row['Clean_Name']) is False:
for consortium in row['Clean_Name'].split(';'):
consortiumlist.append(consortium)
clean_collectives = pd.DataFrame(consortiumlist, columns=['Consortium'])
groupby_collectives = clean_collectives.groupby(
['Consortium'])['Consortium'].count().sort_values(ascending=False)
holdstring = 'The most frequently seen Consortium are: '
for index, row in groupby_collectives.to_frame()[0:50].iterrows():
holdstring = holdstring + index + ' (' + str(row['Consortium']) + '), '
print(holdstring[:-2] + '.')
print('\nWe have seen a total of ' +
str(len(groupby_collectives.to_frame())) + ' different collectives!')
print('A total of ' + str(len(pubmed_collectives['PUBMEDID'].drop_duplicates(
))) + ' papers are contributed to by at least one collective.')
print('A total of ' + str(groupby_collectives.sum()) +
' collective contributions are made.')
Lets now simply wrangle together some of the manually collated data. First lets order and consider the most frequently observed cohorts, with the aim being to analyze the resampling issue discussed earlier in the literature
finaldataset = pd.read_csv(os.path.abspath(
os.path.join('__file__', '../..', 'Data',
'Support', 'Cohorts',
'GWASCat1to1000 final.csv')),
encoding='latin-1',
index_col='Number')
mylist = []
for index, row in finaldataset.iterrows():
mylist = mylist + row['DATASETS'].split(';')
mylist = [x.strip(' ') for x in mylist]
df1 = pd.DataFrame({'Cohorts': mylist})
reader = csv.reader(open(os.path.abspath(os.path.join(
'__file__', '../..', 'Data', 'Support', 'Cohorts',
'Dictionary_cohorts.csv')), 'r'))
d = {}
for row in reader:
k, v = row
d[k] = v
for key in d:
df1['Cohorts'].replace(key, d[key], inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.replace(
'Rotterdam Study I (RS-I)', 'Rotterdam Study (RS)')
newdf = pd.DataFrame(df1.groupby(['Cohorts'])[
'Cohorts'].count(), columns=['Count'])
newdf = pd.DataFrame({'Count': df1.groupby(['Cohorts']).size()}).reset_index()
Then merge this with manually collaged data:
manual_cohort_for_merge = pd.read_csv(os.path.abspath(
os.path.join('__file__', "../..", "Data",
"Support", "Cohorts",
"manual_cohort_for_merge.csv")),
encoding='latin-1', index_col=False)
merged_manual = pd.merge(newdf, manual_cohort_for_merge,
how='left', on='Cohorts')
merged_manual.set_index('Cohorts').sort_values(
by='Count', ascending=False).drop('NO NAME').head(10)
print('We guess there are a total of ' + str(len(newdf)) +
' different cohorts used in the 1000 biggest studies.')
Lets now print out a list of the 100 most commonly used datasets in case we need to reference any specific particularily prominent ones (e.g. deCODE, UKBioBank etc):
merged_manual=merged_manual.set_index('Cohorts').sort_values(
by='Count', ascending=False).drop('NO NAME')[['Count']]
holder='The hundred most commonly used datasets are: '
for index, row in merged_manual[0:100].iterrows():
holder = holder + \
str(index) + ' (' + str(row['Count']) + ' times), '
print(holder[:-2])
As a bit or robustness, lets compare results which drop comma delimited traits in the raw catalog files
#from Support.Robustness import traits_robustness
#from Support.Robustness import gender_robustness
#from Support.Robustness import funder_robustness
#EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(Cat_Studies, Cat_Ancestry_groupedbyN)
#Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
#FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
#traits_robustness(EFO_Parent_Mapped_NoCommas)
#gender_robustness(AuthorMaster,EFO_Parent_Mapped_NoCommas)
#funder_robustness(EFO_Parent_Mapped_NoCommas, FunderInfo, Cat_Ancestry)